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1 The KScalar simulator 

J. C. Moure, Dolores I. Rexachs, Emilio Luque 

March 2002 Journal on Educational Resources in Computing (JERIC), volume 2 issue 1 
Full text available: ' ^pdf(493.35 KB) Additional Information: full citation , abstract , references , index terms 

Modern processors increase their performance with complex microarchitectural 
mechanisms, which makes them more and more difficult to understand and evaluate. 
KScalar Is a graphical simulation tool that facilitates the study of such processors. It allows 
students to analyze the performance behavior of a wide range of processor 
microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, 
superscalar pipeline with non-blocking caches, speculative execution, and comp ... 



Keywords: Education, pipelined processor simulator 



S peculative dynamic vectorization 

Alex Pajuelo, Antonio Gonzalez, Mateo Valero 

May 2002 ACM SIGARCH Computer Architecture News, Volume 30 issue 2 
Full text available: 

^pdf(1.00MB)gll* Additional Information: full citation , abstract, references, index terms 
Publisher Site 

Traditional vector architectures have shown to be very effective for regular codes where the 
compiler can detect data-level parallelism. However, this SIMD parallelism is also present in 
irregular or pointer-rich codes, for which the compiler is quite limited to discover it. In this 
paper we propose a microarchitecture extension In order to exploit SIMD parallelism in a 
speculative way. The idea is to predict when certain operations are likely to be vectorlzable, 
based on some previous history i ... 

Keywords: Speculative dynamic vectorization, wide buses, speculative data computation, 
control independence, vector instructions 



System support for pervasive applications 

Robert Grimm, Janet Davis, Eric Lemar, Adam Macbeth, Steven Swanson, Thomas Anderson, 

Brian Bershad, Gaetano Borriello, Steven Gribble, David Wetherall 

November 2004 ACM Transactions on Computer Systems (TOCS), volume 22 issue 4 

Full text available: ' gDdf(1.82 MB^ Additional Information: full citation , abstract , references , index terms 
Pervasive computing provides an attractive vision for the future of computing. 
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Computational power will be available everywhere. Mobile and stationary devices will 
dynamically connect and coordinate to seamlessly help people in accomplishing their tasks. 
For this vision to become a reality, developers must build applications that constantly adapt 
to a highly dynamic computing environment. To make the developers' task feasible, we 
present a system architecture for pervasive computing, called & ... 

Keywords: Asynchronous events, checkpointing, discovery, logic/operation pattern, 
migration, one. world, pervasive computing, structured I/O, tuples, ubiquitous computing 



Algorithms for scalable synchronization on shared-memory multiprocessors 
John M. Mellor-Crummey, Michael L. Scott 

February 1991 ACM Transactions on Computer Systems (TOCS), Volume 9 issue i 

Full text available: « pdf(3.07 MB) Additional Information: full citation , abstract , references , citings, index 
• L±j*^ terms , review 

Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in 
shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting 
tend to produce large amounts of memory and interconnect contention, introducing 
performance bottlenecks that become markedly more pronounced as applications scale. We 
argue that this problem is not fundamental, and that one can in fact construct busy-wait 
synchronization algorithms that induce no memory or interc ... 

Tarantula: a vector extension to the alpha architecture 

Roger Espasa, Federico Ardanaz, Joel Emer, Stephen Felix, Julio Gago, Roger Gramunt, Isaac 
Hernandez, Toni Juan, Geoff Lowney, Matthew Mattina, Andre Seznec 
May 2002 ACM SIGARCH Computer Architecture News, Volume 30 issue 2 

Full text available: i^ p^jf^^ ^Q|y;|Q)( ^ Additional Information: full citation , abstract , references , citings , index 
Publisher Site teODS 

Tarantula Is an aggressive floating point machine targeted at technical, scientific and 
bioinfornnatics workloads, originally planned as a follow-on candidate to the EV8 processor 
[6, 5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per 
cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak 
bandwidth of sixty four 64-bit values per cycle. The whole chip Is backed by a memory 
controller capable of delivering over 64 GBytes/s ... 

Keywords: Vector Processor, Microprocessor, High Performance, Bandwidth, Power, 
Instruction Set Architecture, Virtual Memory, Cache Coherency 



The hardware architecture of the CRISP microprocessor 
D. R. Dltzel, H. R, McLellan, A. D. Berenbaum 

June 1987 Proceedings of the 14th annual international symposium on Computer 
architecture 

Full text available: ^pdf(930.17 KB) Additional Information: full citation , references, citings , index terms 



^ Memory Ordering: A Value-Based Approach 
Harold W. Cain, Mikko H. Lipasti 

March 2004 ACM SIGARCH Computer Architecture News , Proceedings of the 31st 

annual international symposium on Computer architecture - Volume 00, 

Volume 32 Issue 2 

Full text available: ^ Ddf(244.36 KB) Additional Information: full citation , abstract 

Conventional out-of-order processors employ a multi-ported,fully-associative load queue to 
guarantee correctnnemory reference order both witliln a single thread of executionand 
across threads in a multiprocessor system. Asimprovements in process technology and 
pipelining lead tohlgher clock frequencies, scaling this complex structure toaccommodate a 
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larger number of in-flight loads becomesdifficult.if not impossible. Furthermore, each access 
to thiscomplex structure consumes excessive amounts of e ... 

8 Half-price architecture Q 

Ilhyun Kim, Mikko H. Lipasti 

May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture, volume 3i issue 2 
Full text available: ^ pdff278.61 KB) Additional Information: full citation, abstract, references 

Current-generation microprocessors are designed to process instructions with one and two 
source operands at equal cost. Handling two source operands requires multiple ports for 
each Instruction In structures— such as the register file and wakeup logic— which are often in 
the processor's critical timing paths. We argue that these structures are overdesigned since 
only a small fraction of instructions require two source operands to be processed 
simultaneously. In this paper, we propose the half-pr ... 

® Remote queues: exposing message queues for optimization and atomicity Q 
Eric A. Brewer, Frederic T. Chong, Lok T. Liu, Shamik D. Sharma, John D. Kubiatowicz 
July 1995 Proceedings of the seventh annual ACM symposium on Parallel algorithms 
and architectures 

Full text available: ^ pdf(1.78MB) Additional Information: full citation , references, citings , index terms 



10 Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data | 
Canturk IscI, Margaret Martonosi 

December 2003 Proceedings of the 36th Annual IEEE/ACM International Symposium on 
Microarchitecture 

Full text available: gpdf(921.50 KB) Additional Information: full citation, abstract, citings, index Xerms 

With power dissipation beconning an increasingly vexingproblem across many classes of 
computer systems, measuringpower dissipation of real, running systems has becomecrucial 
for hardware and software system research and design. Live power measurements are 
Imperative for studiesrequiring execution times too long for simulation, such asthermal 
analysis. Furthermore, as processors become morecomplex and include a host of aggressive 
dynamic powermanagement techniques, per-component estimates of powerd ... 

''^ Accelerating shared virtual memory via general-purpose network interface sup port | 
Angelos Bilas, Dongming Jiang, Jaswinder Pal Singh 

February 2001 ACM Transactions on Computer Systems (TOCS), volume 19 issue i 

Full text available: pipdfnys.SS KB) Additional Information: full citation, abstract, references, index terms. 
^ review 

Clusters of symmetric multiprocessors (SMPs) are important platforms for high-performance 
computing. With the success of hardware cache-coherent distributed shared memory 
(DSM), a lot of effort has also bieen made to support the coherent shared-address-space 
programming model in software on clusters. Much research has been done in fast 
communication on clusters and in protocols for supporting software shared memory across 
them. However, the performance of software virtual memory (SVM) is sti ... 

Keywords: applications, clusters, shared virtual memory, system area networks 



Decoupled hardware support for distributed shared memory Q 
Steven K. Reinhardt, Robert W. Pfile, David A. Wood 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture, volume 24 issue 2 
Full text available: P|pdff1.47 MB^ Additional Information: full citation, abstract, references, citings, index 

terms 

This paper Investigates hardware support for fine-grain distributed shared memory (DSM) 
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in networks of workstations. To reduce design time and implementation cost relative to 
dedicated DSM systems, we decouple the functional hardware components of DSM support, 
allowing greater use of off-the-shelf devices. We present two decoupled systems, Typhoon-O 
and Typhoon-1. Typhoon-O uses an off-the-shelf protocol processor and network Interface; 
a custom access control device Is the only DSM-speclfIc hard ... 

^3 The Clipper processor: instruction set architecture and implementation Q 
W. Hollingsworth, H. Sachs, A. J. Smith 
February 1989 Communications of the ACM, volume 32 issue 2 

Full text available' 1^ pdf(4.67 MB) Additional Information: full citation , abstract , references , citings , index 

terms , review 

Intergraph's CLIPPER microprocessor is a high performance, three chip module that 
Implements a new instruction set architecture designed for convenient programmabllity, 
broad functionality, and easy future expansion. 

SPLASH: Stanford parallel applications for shared-memory Q 

Jaswinder Pal Singh, Wolf-Dietrich Weber, Anoop Gupta 

March 1992 ACM SIGARCH Computer Architecture News, volume 20 issue i 

Full text available: ^ pdf(3>04 MB) Additional Information: full citation , abstract, citings, index terms 

We present the Stanford Parallel Applications for Shared-Memory (SPLASH), a set of parallel 
applications for use in the design and evaluation of shared-memory multiprocessing 
systems. Our goal Is to provide a suite of realistic applications that will serve as a well- 
documented and consistent basis for evaluation studies. We describe the applications 
currently in the suite in detail, discuss some of their important characteristics, and explore 
their behavior by running them on a real multiprocess ... 

15 Synchronization and communication in the T3E multiprocessor Q 
Steven L. Scott 

September 1996 Proceedings of the seventh international conference on Architectural 
support for programming languages and operating systems, volume 3i , 

30 Issue 9 , 5 

Full text available: ■p pdfn.34MB) Additional Information: full citation , abstract , references , citings , index 

terms 

This paper describes the synchronization and connmunication primitives of the Cray T3E 
multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we 
have learned from the T3D project (the predecessor to the T3E) and the rationale behind 
changes made for the T3E. We include performance measurements for various aspects of 
communication and synchronization The T3E augments the memory Interface of thei DEC 
21164 microprocessor with a large set of explicitly-managed, external r ... 

^6 Cache Memories Q 
Alan Jay Smith 

September 1982 ACM Computing Surveys (CSUR), Volume 14 issues 

Full text available: ^pdf(4.61 MB) Additional Information: full citation , references , citings , index terms 



^7 A system for authenticated policy-compliant routing Q 
Barath Raghavan, Alex C. Snoeren 

August 2004 ACM SIGCOMM Computer Communication Review , Proceedings of the 
2004 conference on Applications, technologies, architectures, and 
protocols for computer communications, volume 34 issue 4 

Full text available: ^pdf(219.77 KB) Additional Information: full citation , abstract, references, index terms 

Internet end users and ISPs alike have little control over how packets are routed outside of 
their own AS, restricting their ability to achieve levels of performance, reliability, and utility 
that nnlght otherwise be attained. While researchers have proposed a number of source- 
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routing techniques to combat this limitation, there has thus far been no way for 
independent ASes to ensure that such traffic does not circumvent local traffic policies, nor 
to accurately determine the correct party to char ... 

Keywords: authentication, capabilities, overlay networks, source routing 



18 Analysis of multiprocessor cache organizations with alterniative main memory update Q 
policies 

W. C. Yen, K. S. Fu 

May 1981 Proceedings of the 8th annual symposium on Computer Architecture 

Full text available' IS DdfM 04 MB) Additional Information: full citation , abstract , references , citings , index 

^ terms 

Cache memory has played a significant role in the memory hierarchy and has been used 
extensively in large systems and minisystems. The effectiveness of cache memories with 
alternative main memory update policies in a multiprocessor system is a major concern in 
this paper. The performances of write-through with write-allocation or no-write allocation, 
buffered write-through, flag-swap, and buffered flag-swap policies have been analyzed. 
Because of the dominating cost of the interface between ... 

19 Illustrative risks to the public in the use of computer systems and related technology Q 
Peter G. Neumann 

January 1996 ACM SIGSOFT Software Engineering Notes, volume 21 issue i 
Full text available: Spdf(2.54 MB) Additional Infomriation: full citation 



20 A "flight data recorder" for enabling full-system multiprocessor deterministic replay HI 
Min Xu, Rastislav Bodik, Mark D. Hill 

May 2003 ACi^l SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture, volume 31 issue 2 
Full text available: ^ pdf(311.95 KB) Additional Information: full citation , abstract , references 

Debuggers have been proven Indispensable in Improving software reliability. Unfortunately, 
on most real-life software, debuggers fall to deliver their most essential feature — a 
faithful replay of the execution. The reason is non-determinism caused by multithreading 
and non-repeatable inputs. A common solution to faithful replay has been to record the 
non-deterministic execution. Existing recorders, however, either work only for datarace-free 
programs or have prohibitive overhead.As a step tow ... 
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